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We propose a new penalized method for variable selection and es- 
timation that explicitly incorporates the correlation patterns among 
predictors. This method is based on a combination of the minimax 
concave penalty and Laplacian quadratic associated with a graph as 
the penalty function. We call it the sparse Laplacian shrinkage (SLS) 
method. The SLS uses the minimax concave penalty for encouraging 
sparsity and Laplacian quadratic penalty for promoting smoothness 
among coefficients associated with the correlated predictors. The SLS 
has a generalized grouping property with respect to the graph repre- 
sented by the Laplacian quadratic. We show that the SLS possesses 
an oracle property in the sense that it is selection consistent and 
equal to the oracle Laplacian shrinkage estimator with high prob- 
ability. This result holds in sparse, high-dimensional settings with 

n under reasonable conditions. We derive a coordinate descent 
algorithm for computing the SLS estimates. Simulation studies are 
conducted to evaluate the performance of the SLS method and a real 
data example is used to illustrate its application. 

1. Introduction. There has been much work on penahzed methods for 
variable selection and estimation in high-dimensional regression models. Sev- 
eral important methods have been proposed. Examples include estimators 
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based on the bridge penalty [Frank and Friedman (1993)], the ii penalty 
or the least absolute shrinkage and selection operator [LASSO, Tibshirani 
(1996), Chen, Donoho and Saunders (1998)], the smoothly clipped absolute 
deviation (SCAD) penalty [Fan (1997), Fan and Li (2001)] and the min- 
imum concave penalty [MCP, Zhang (2010)]. These methods are able to 
do estimation and automatic variable selection simultaneously and provide 
a computationally feasible way for variable selection in high-dimensional 
settings. Much progress has been made in understanding the theoretical 
properties of these methods. Efficient algorithms have also been developed 
for implementing these methods. 

A common feature of the methods mentioned above is the independence 
between the penalty and the correlation among predictors. This can lead to 
unsatisfactory selection results, especially in n settings. For example, 
as pointed out by Zou and Hastie (2005), the LASSO tends to only select 
one variable among a group of highly correlated variables; and its prediction 
performance may not be as good as the ridge regression if there exists high 
correlation among predictors. To overcome these limitations, Zou and Hastie 
(2005) proposed the elastic net (Enet) method, which uses a combination of 
the ii and £2 penalties. Selection properties of the Enet and adaptive Enet 
have also been studied by Jia and Yu (2010) and Zou and Zhang (2009). 
Bondell and Reich (2008) proposed the OSCAR (octagonal shrinkage and 
clustering algorithm for regression) approach, which uses a combination of 
the ii norm and a pairwise £00 norm for the coefficients. Huang et al. (2010a) 
proposed the Mnet method, which uses a combination of the MCP and £2 
penalties. The Mnet estimator is equal to the oracle ridge estimator with 
high probability under certain conditions. These methods are effective in 
dealing with certain types of collinearity among predictors and has the use- 
ful grouping property of selecting and dropping highly correlated predictors 
together. Still, these combination penalties do not use any specific informa- 
tion on the correlation pattern among the predictors. 

Li and Li (2008) proposed a network-constrained regularization proce- 
dure for variable selection and estimation in linear regression models, where 
the predictors are genomic data measured on genetic networks. Li and Li 
(2010) considered the general problem of regression analysis when predic- 
tors are measured on an undirected graph, which is assumed to be known 
a priori. They called their method a graph-constrained estimation procedure 
or GRACE. The GRACE penalty is a combination of the ii penalty and 
a penalty that is the Laplacian quadratic associated with the graph. Because 
the GRACE uses the ii penalty for selection and sparsity, it has the same 
drawbacks as the Enet discussed above. In addition, the full knowledge of the 
graphical structure for the predictors is usually not available, especially in 
high-dimensional problems. Daye and Jeng (2009) proposed the weighted fu- 
sion method, which also uses a combination of the ii penalty and a quadratic 
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form that can incorporate information among correlated variables for esti- 
mation and variable selection. Tutz and Ulbricht (2009) studied a form of 
correlation based penalty, which can be considered a special case of the gen- 
eral quadratic penalty. But this approach does not do variable selection. 
The authors proposed a blockwise boosting procedure in combination with 
the correlation based penalty for variable selection. Hebiri and van de Geer 
(2010) studied the theoretical properties of the smoothed-Lasso and other 
h + -^2 -penalized methods in p» n models. Pan, Xie and Shen (2011) stud- 
ied a grouped penalty based on the L^-norm for 7 > 1 that smoothes the 
regression coefficients over a network. In particular, when 7 = 2 and after 
appropriate rescaling of the regression coefficients, this group penalty 
simplifies to the group Lasso [Yuan and Lin (2006)] with the nodes in the 
network as groups. This method is capable of group selection, but it does 
not do individual variable selection. Also, because the group penalty is 
convex for 7 > 1, it does not lead to consistent variable selection, even at 
the group level. 

We propose a new penalized method for variable selection and estimation 
in sparse, high-dimensional settings that takes into account certain correla- 
tion patterns among predictors. We consider a combination of the MCP and 
Laplacian quadratic as the penalty function. We call the proposed approach 
the sparse Laplacian shrinkage (SLS) method. The SLS uses the MCP to 
promote sparsity and Laplacian quadratic penalty to encourage smoothness 
among coefficients associated with the correlated predictors. An important 
advantage of the MCP over the ii penalty is that it leads to estimators that 
are nearly unbiased and achieve selection consistency under weaker condi- 
tions [Zhang (2010)]. 

The contributions of this paper are as follows. 

• First, unlike the existing methods that use an £1 penalty for selection 
and a ridge penalty or a general £2 penalty for dealing with correlated 
predictors, we use the MCP to achieve nearly unbiased selection and pro- 
posed a concrete class of quadratics, the Laplacians, for incorporating 
correlation patterns among predictors in a local fashion. In particular, 
we suggest to employ the approaches for network analysis for specifying 
the Laplacians. This provides an implementable strategy for incorporating 
correlation structures in high-dimensional data analysis. 

• Second, we prove that the SLS estimator is sign consistent and equal 
to the oracle Laplacian shrinkage estimator under reasonable conditions. 
This result holds for a large class of Laplacian quadratics. An important 
aspect of this result is that it allows the number of predictors to be larger 
than the sample size. In contrast, the works of Daye and Jeng (2009) and 
Tutz and Ulbricht (2009) do not contain such results in p 3> n models. 
The selection consistency result of Hebiri and van de Geer (2010) requires 
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certain strong assumptions on the magnitude of the smahest regression 
coefficient (their Assumption C) and on the correlation between impor- 
tant and unimportant predictors (their Assumption D), in addition to 
a variant of the restricted eigenvalue condition (their Assumption B). In 
comparison, our assumption involving the magnitude of the regression 
coefficients is weaker and we use a sparse Riese condition instead of im- 
posing restriction on the correlations among predictors. In addition, our 
selection results are stronger in that the SLS estimator is not only sign 
consistent, but also equal to the oracle Laplacian shrinkage estimator with 
high probability. In general, similar results are not available with the use 
of the ii penalty. 

• Third, we show that the SLS method is potentially capable of incorporat- 
ing correlation structure in the analysis without incurring extra bias. The 
Enet and the more general ii + £2 methods in general introduces extra 
bias due to the quadratic penalty, in addition to the bias resulting from 
the ii penalty. To the best of our knowledge, this point has not been dis- 
cussed in the existing literature. We also demonstrate that the SLS has 
certain local smoothing property with respect to the graphical structure 
of the predictors. 

• Fourth, unlike in the GRACE method, the SLS does not assume that the 
graphical structure for the predictors is known a priori. The SLS uses 
the existing data to construct the graph Laplacian or to augment partial 
knowledge of the graph structure. 

• Fifth, our simulation studies demonstrate that the SLS method outper- 
forms the ii penalty plus a quadratic penalty approach as studied in Daye 
and Jeng (2009) and Hebiri and van de Geer (2010). In our simulation ex- 
amples, the SLS in general has smaller empirical false discovery rates with 
comparable false negative rates. It also has smaller prediction errors. 

This paper is organized as follows. In Section 2, we define the SLS es- 
timator. In Section 3 we discuss ways to construct graph Laplacian, or 
equivalently, its corresponding adjacency matrix. In Section 4, we study 
the selection properties of the SLS estimators. In Section 5, we investigate 
the properties of Laplacian shrinkage. In Section 6, we describe a coordinate 
descent algorithm for computing the SLS estimators, present simulation re- 
sults and an application of the SLS method to a microarray gene expression 
dataset. Discussions of the proposed method and results are given in Sec- 
tion 7. Proofs for the oracle properties of the SLS and other technical details 
are provided in the Appendix. 

2. The sparse Laplacian shrinkage estimator. Consider the linear regres- 
sion model 
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with n observations and p potential predictors, where y = (yi, . . . , is the 
vector of n response variables, Xj = (xy , . . . , Xnj)' is the jth predictor, /3j 
is the jth regression coefficient and e = (ei, . . . ,en)' is the vector of random 
errors. Let X = (xi,...,Xp) be the n x p design matrix. Throughout, we 
assume that the response and predictors are centered and the predictors are 
standardized so that XlILi ^Ij = = 1; • • • ^P- For A = (Ai, A2) with Ai > 
and A2 > 0, we propose the penalized least squares criterion 

M(b;A,7) = ^||y-Xb||2 + ^p(|6,.|;Ai,7) 

(2.2) ^ 

+ 2-^2 ^ \ajk\{bj - Sjkbkf, 

l<j<k<p 

where || • || denotes the £2 norm, p is the MCP with penalty parameter Ai and 
regularization parameter 7, \ajk\ measures the strength of the connection be- 
tween Xj and Xfc, and Sjk = sgn(ajfc) is the sign of Ojfc, with sgn(t) = —1,0 
or 1, respectively, for t < 0, = or > 0. The two penalty terms in (2.2) play 
different roles. The first term promotes sparsity in the estimated model. The 
second term encourages smoothness of the estimated coefficients of the con- 
nected predictors. We can associate the quadratic form in this term with the 
Laplacian for a suitably defined undirected weighted graph for the predic- 
tors. See the description below. For any given (A, 7), the SLS estimator is 

(2.3) ^(A,7) = argminM(b;A,7). 

b 

The SLS uses the MCP, defined as 

r\t\ 

(2.4) p(t;Ai,7) = Ai / {l-x/{-fXi))_,dx, 

Jo 

where for any a G R, a+ is the nonnegative part of a, that is, a+ = al^a>o}- 
The MCP can be easily understood by considering its derivative, 

(2.5) /5(i;Ai,7) = Ai(l - |t|/(7Ai))+sgn(t). 

We observe that the MCP begins by applying the same level of penalization 
as the £1 penalty, but continuously reduces that level to for \t\ > 7A. The 
regularization parameter 7 controls the degree of concavity. Larger values 
of 7 make p less concave. By sliding the value of 7 from 1 to cxo, the MCP 
provides a continuum of penalties with the hard-threshold penalty as 7 — t- 1-|- 
and the convex ii penalty at 7 = 00. Detailed discussion of MCP can be 
found in Zhang (2010). 

The SLS also allows the use of different penalties than the MCP for p, 
including the SCAD [Fan (1997), Fan and Li (2001)] and other quadratic 
splines. Because the MCP minimizes the maximum concavity measure and 
has the simplest form among nearly unbiased penalties in this family, we 
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choose it as the default penalty for the SLS. Further discussion of the MCP 
and its comparison with the LASSO and SCAD can be found in Zhang 
(2010) and Mazumder, Friedman and Hastie (2009). 

We express the nonnegative quadratic form in the second penalty term 
in (2.2) using a positive semi-definite matrix L, which satisfies 

b'Lb = \ajk\{bj - Sjkhkf Mh^W. 

l<j<k<p 

For simplicity, we confine our discussion to the symmetric case where akj = 
o-jkA 1^ j < k < p. Since the diagonal elements ajj do not appear in the 
quadratic form, we can define them any way we like for convenience. Let 
A = {ajk, I < j,k <p) and D = diag((ii, . . . , dp), where dj = Yl^=i \<J-jk\- We 
have Y2i<j<k<p I'^jklibj — sjkbk)'^ = h'{D — A)h. Therefore, L = D — A. This 
matrix is associated with a labeled weighted graph Q = {V,£) with vertex 
set V = {1,. . . ,p} and edge set £ = {(j, k) : (j, k) £V x V}. Here the \ajk\ 
is the weight of edge {j,k) and dj is the degree of vertex j. The dj is also 
called the connectivity of vertex j . The matrix L is called the Laplacian of Q 
and A its signed adjacency matrix [Chung (1997)]. The edge {j,k) is labeled 
with the "+" or "— " sign, but its weight \ajk\ is always nonnegative. We 
use a labeled graph to accommodate the case where two predictors can have 
a nonzero adjacency coefficient but are negatively correlated. Note that the 
usual adjacency matrix can be considered a special case of signed adjacency 
matrix when all aj^ > 0. For simplicity, we will use the term adjacency 
matrix below. 

We usually require that the adjacency matrix to be sparse in the sense that 
many of its entries are zero or nearly zero. With a sparse adjacency matrix, 
the main characteristic of the shrinkage induced by the Laplacian penalty 
is that it occurs locally for the coefficients associated with the predictors 
connected in the graph. Intuitively, this can be seen by writing 

^2 ^ \ajk\{bj - ajkhk)"^ = ^X2 ^ \ajk\{bj - sjkbk)'^ . 

l<j<k<p {j,k):ajk¥^0 

Thus for A2 > 0, the Laplacian penalty shrinks bj — Sj^b^ toward zero for 
ci-jk 7^ 0. This can also be considered as a type of local smoothing on the 
graph Q associated with the adjacency matrix A. In comparison, the shrink- 
age induced by the ridge penalty used in the Enet is global in that it shrinks 
all the coefficients toward zero, regardless of the correlation structure among 
the predictors. We will discuss the Laplacian shrinkage in more detail in Sec- 
tion 5. 

Using the matrix notation, the SLS criterion (2.2) can be written as 
(2.6) M(b; A,7) = ^lly " Xhf + Y^p{\b,\; X,,j) + -Ash' (I) - ^)b. 
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Here the Laplacian is not normalized, meaning that the weight dj is not 
standardized to 1. In problems where predictors should be treated without 
preference with respect to connectivity, we can first normalized the Laplacian 
L* =Ip- A* with A* = D~^/'^AD~^/'^ and use the criterion 

1 ^ 1 

M*(b; A,7) = ^lly - Xhf + Ai,7) + -\2h'% - A*)h. 

Technically, a normalized Laplacian L* can be considered a special case of 
a general L. We only consider the SLS estimator based on the criterion (2.6) 
when studying its properties. In network analysis of gene expression data, 
genes with large connectivity also tend to have important biological func- 
tions [Zhang and Horvath (2005)]. Therefore, it is prudent to provide more 
protection for such genes in the selection process. 

3. Construction of adjacency matrix. In this section, we describe sev- 
eral simple forms of adjacency measures proposed by Zhang and Horvath 
(2005), which have have been successfully used in network analysis of gene 
expression data. The adjacency measure is often defined based on the notion 
of dissimilarity or similarity. 

(i) A basic and widely used dissimilarity measure is the Euclidean dis- 
tance. Based on this distance, we can define adjacency coefficient as ajk = 
0(||xj — Xfe||/-y/n), where (f): [0,oo) i— t- [0,cx3). A simple adjacency function is 
the threshold function 0(x) = l{x < 2r}. Then 

/qiN „ if ||xj -Xfc||/V^< 2r, 

^ ' "^■'=-\0, if ||x,-Xfc||/V^>2r. 

It is convenient to express ajk in terms of the Pearson's correlation coeffi- 
cient Tjk between Xj and x^, where rjk = XjXfc/(||xj||||xfc||). For predictors 
that are standardized with ||xj |p = n, 1 < j < p, we have ||xj — x^p/n = 2 — 
2rjfc. Thus in terms of correlation coefficients, we can write ajj. = ^{rjk > r}- 
We determine the value of r based on the Fisher transformation Zjk = 
0.51og((l + rjfc)/(l — rjk)). If the correlation between Xj and x^ is zero, 
\Jn — 3zjk is approximately distributed as A^(0,1). We can use this to de- 
termine a threshold c for \/n — 3zjk ■ The corresponding threshold for rjk is 
r = (exp(2c/Vra^) - l)/(exp(2c/Vn^) + 1). 

We note that here we use the Fisher transformation to change the scale of 
the correlation coefficients from [—1, 1] to the normal scale for determining 
the threshold value r, so that the adjacency matrix is relatively sparse. We 
are not trying to test the significance of correlation coefficients. 

(ii) The adjacency coefficient in (3.1) is defined based on a dissimilar- 
ity measure. Adjacency coefficient can also be defined based on similarity 
measures. An often used similarity measure is Pearson's correlation coef- 
ficient rjk- Other correlation measures such as Spearman's correlation can 
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also be used. Let 

Sjk=sgn{rjk) and ajj, = Sjkl{\rjk\ > r}. 

Here r can be determined using the Fisher transformation as above. 

(iii) With the power adjacency function considered in Zhang and Horvath 
(2005), 

Ujk = max(0, ^j-fc)" and Sjk = 1. 

Here a > and can be determined by, for example, the scale-free topology 
criterion. 

(iv) A variation of the above power adjacency function is 

ajk = kifcT and Sjk = sgn{rjk). 

For the adjacency matrices given above, (i) and (ii) use dichotomized mea- 
sures, whereas (iii) and (iv) use continuous measures. Under (i) and (iii), 
two covariates are either positively or not connected/correlated. In con- 
trast, under (ii) and (iv), two covariates are allowed to be negatively con- 
nected/ correlated . 

There are many other ways for constructing an adjacency matrix. For ex- 
ample, a popular adjacency measure in cluster analysis is ajk = exp(— ||xj — 
Xfcp/nr^) for r > 0. The resulting adjacency matrix A = [ajk] is the Gram 
matrix associated with the Gaussian kernel. For discrete covariates, the Pear- 
son correlation coefficient can still be used as a measure of correlation or 
association between two discrete predictors or between a discrete predictor 
and a continuous one. For example, for single nucleotide polymorphism data, 
Pearson's correlation coefficient is often used as a measure of linkage dise- 
quilibrium (i.e., association) between two markers. Other measures, such as 
odds ratio or measure of association based on contingency table can also be 
used for r^fc. 

We note that how to construct the adjacency matrix is problem specific. 
Different applications may require different adjacency matrices. Since con- 
struction of adjacency matrix is not the focus of the present paper, we will 
only consider the use of the four adjacency matrices described above in our 
numerical studies in Section 6. 



4. Oracle properties. In this section, we study the theoretical properties 
of the SLS estimator. Let the true value of the regression coefficient be 
13° = {/31, . . .,/3°y. Denote O = {j : /3° / 0}, which is the set of indices of 
nonzero coefficients. Let d° = \0\ be the cardinality of O. Define 

(4.1) ^°(A2) = argminj ^||y - Xbf + ^Aab'Lb, b, =0,j^Oy 

This is the oracle Laplacian shrinkage estimator on the set O. Theorems 1 
and 2 below provide sufficient conditions under which P(sgn(/9) ^ sgn(/3°) 
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or /3 7^ (3°) — ^ 0. Thus, under those conditions, the SLS estimator is sign 
consistent and equal to /3° with high probabihty. 

We need the following notation in stating our results. Let S = n^^X'X. 
For any AuB C {1, . . . vectors v, the design matrix X and V = {vij)pxp, 
define 

= {vj,j € B)', XB = {^j,j eB), 
Va,b = {vij,i^ A,j G Vb = Vb,b- 

For example, T,b = X'^Xb/ti and T,c){^2) = '^o + ^2^0- Let \B\ denote the 
cardinality of B. Let Cmin(A2) be the smallest eigenvalue of S + X2L. We use 
the following constants to bound the bias of the Laplacian: 

(4.2) _^ 

C2 = ||{Sc)c o(A2)Sq {X2)Lo — Lo':,o}l3o\\oo- 

We make the following sub-Gaussian assumption on the error terms in (2.1). 

Condition (A). For a certain constant e G (0,1/3), 

sup P{u'e > at} < e"*'/^ 0<t< y^2log{p/e). 

M=i 

4.1. Convex penalized loss. We first consider the case where S(A2) = 
S + X2L is positive definite. Since (4.1) is the minimizer of the Laplacian 
restricted to the support O, it can be explicitly written as 

(4.3) = (So + X2Lo)-'X'ay/n, = 0, 

provided that '^0(^2) is invertible. Its expectation f3* = E(3°, considered as 
a target of the SLS estimator, must satisfy 

(4.4) (3*a = {^o + X2Lo)-'^o(3", /3^c = 0. 

Condition (B). (i) Cmin(A2) > I/7 with p(t;Ai,7) in (2-2). 

(ii) The penalty levels satisfy 

Ai > A2C2 + 0- J2 log((p - d°)/€) max ||x,- ||/n 
^ j<p 

with C2 in (4.2). 

(iii) With {vj,j G O} being the diagonal elements of Y,q^ {X2)T,o{'^q^ {X2)} , 

mm{\/3*\{n/v,y/^} > ay^2hi(¥/e}. 

Define /3* = min{|/3?|,j G O}. If O is an empty set, that is, when all the 
regression coefficients are zero, we set /3* = 00. 
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Theorem 1. Suppose Conditions (A) and (B) hold. Then 

(4.5) P({i:/3, /0}/O or^/^°)<3e. 

If f^*^ A2C1 + maxj {2vj/n) log(d°/e) instead of Condition (B){m), then 

(4.6) P(sgn(^) / sgn(/3°) or /3 / ^'') < 3e. 

Here note that p,d°,^ and Cmin(A2) are all allowed to depend on n. 

The probability bound on the selection error in Theorem 1 is nonasymp- 
totic. If the conditions of Theorem 1 hold with e — )• 0, then (4.5) implies se- 
lection consistency of the SLS estimator and (4.6) implies sign consistency. 
The conditions are mild. Condition (A) concerns the tail probabilities of 
the error distribution and is satisfied if the errors are normally distributed. 
Condition (B)(i) ensures that the SLS criterion is strictly convex so that the 
solution is unique. The oracle estimator /3° is biased due to the Laplacian 
shrinkage. Condition (B)(ii) requires a penalty level Ai to prevent this bias 
and noise to cause false selection of variables in O'^. Condition (B)(iii) re- 
quires that the nonzero coefficients not be too small in order for the SLS 
estimator to be able to distinguish nonzero from zero coefficients. 

In Theorem 1, we only require Cmin(A2) > 0, or equivalently, S + to be 
positive definite. The matrix S can be singular. This can be seen as follows. 
The adjacency matrix partitions the graph into disconnected cliques Vg, 
^ ^ 9 ^ J, for some J > 1. Let node jg be a (representative) member of Vg. 
A node k belongs to the same clique Vg iff (if and only if) ajgfciafcifc2 ' " ' "^femfc 7^ 
through a certain chain —)• fci —)• /c2 —)•••• —)• km — )• k. Define = 
T.keVg^jaki^kik2---akmk^k/\Vg\, where \Vg\ is the cardinality of Vg. The 
matrix S -|- X2L is positive definite iff b'Sb = b'Lb = implies b = 0. Since 
b'Lb = implies X^^g^g ^k^k = ^jg\^g\^gj ^ + -^2^ is positive definite iff the 
vectors 5cg are linearly independent. This does not require n>p. In other 
words, Theorem 1 is applicable to p> n problems as long as the vectors 
are linearly independent. 

4.2. The nonconvex case. When S(A2) = S -|- A2L is singular, Theorem 1 
is not applicable. In this case, further conditions are required for the oracle 
property to hold. The key condition needed is the sparse Reisz condition, or 
SRC [Zhang and Huang (2008)], in (4.9) below. It restricts the spectrum of 
diagonal subblocks of 5](A2) up to a certain dimension. 

Let X = X{\2) be a matrix satisfying X'X/n = S(A2) = X'X/n + X2L 
and y = y(A2) be a vector satisfying X'y = X'y. Define 

(4.7) M(b;A,7) = — ||y-^bf + ^p(|6,|;Ai,7). 
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Since M(b;A,7) -M{h;X,j) = (||y|p - ||y|p)/(2n), the two penalized loss 
functions have the same set of local minimizers. For the penalized loss (4.7) 
with the data {X,y), let 

(4.8) ^(A) = 5(X(A2),y(A2),Ai), 

where the map S{X,y,Xi) G W defines the MC+ estimator [Zhang (2010)] 
with data {X,y) and penalty level Ai. It was shown in Zhang (2010) that 
5(X,y,Xi) depends on {X,y) only through X'y/n and X'X/n, so that 
different choices of X and y are allowed. One way is to pick y = (y', 0)' and 
X = diag(X, (nA2-J/)^^^). Another way is to pick X X /n = S(A2) and y = 
(X')^X'y of smaller dimensions, where {X')'^ is the Moore-Penrose inverse 
of X'. 

Condition (C). (i) For an integer d* and spectrum bounds < c*(A2) < 
C*(A2) < oo, 

< c,(A2) < u'^Sb(A2)ub < c*(A2) < oo 

(4.9) 

MB with \B\JO\<d*, ||ub|| = 1, 

with d* > d°{K, + 1), 7 > c-i(A2)^4 + c,(A2)/c*(A2) in (2.2), and = 
c*(A2)/c,(A2)-l/2. 

(ii) With C2 = \\{^B,o{^2)^oH^2)Lo - Lb,o}(3o\\oo, 

max{l, c*(A2)-fsr*/(i^* + l)}Ai > A2C2 + ay^2log{p/e) max ||xJ/n. 

j<p 

(iii) With {vj,j G O} being the diagonal elements oi'SQ^{X2)^o{'^o^i^2)}, 
mm{\f3*\ - 7(27?(A^Ai)}(n/^;,f /2 > ay^llogid" /e). 

Theorem 2. (i) Suppose Conditions (A) and (C) hold. Let P{X) be as 
in (4-8). Then 

(4.10) P({i : / 0} / O or ^ / P°) < 3e. 

If P* > A2C1 + 7(2y^ c*(A2)Ai) + maxj ^y {2vj/n) log{d°/e) instead of Condi- 
tion (C){m), then 

(4.11) P(sgn(^) ^ sgn{(3°) or P ^ /3°) < 3e. 

Here note thatp, 7, d°, d* , K^, e, c*(A2) andc*{X2) are all allowed to depend 
on n, including the case c*(A2) — )• as long as the conditions hold as stated. 

(ii) The statements in (i) also hold for all local minimizers j3 of (2.6) 
or (4.7) satisfying #{j ^ O : ^ 0} + d" < d* . 

If the conditions of Theorem 2 hold with e — )• 0, then (4.10) implies selec- 
tion consistency of the SLS estimator and (4.11) implies sign consistency. 
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Condition (C), designed to handle the noncovexity of the penahzed loss, 
is a weaker version of Condition (B) in the sense of allowing singular S(A2). 
The SRC (4.9), depending on X or X only through the regularized Gram 
matrix X' X /n = S(A2) = S + ensures that the model is identifiable 
in a lower d*-dimensional space. When p > n, the smallest singular value 
of X is always zero. However, the requirement c*(A2) > only concerns 
d* X d* diagonal submatrices of S(A2), not the Gram matrix S of the design 
matrix X. We can have n but still require d* /d° > K^, + 1 as in (4.9). 
Since p,d^,^, d* , K^, c^,(A2) and c*(A2) can depend on n, we allow the 
case c*(A2) — )• as long as Conditions (A) and (C) hold as stated. Thus, 
we allow p^ n but require that the model is sparse, in the sense that the 
number of nonzero coefficients d° is smaller than d* + K^,). For example, 
if c*(A2) 0(n~") for a small a > and c*(A2) ^ 0(1), then we require 
7 X 0(n^"/^) or greater, K* x 0(n") and d* /d° x 0(n") or greater. So all 
these quantities can depend on n, as long as the other requirements are met 
in Condition (C). 

By examining the Conditions (C)(ii) and (C)(iii), for standardized pre- 
dictors with ||xj|| = -y/n, we can have log(p/e) = o(n) 01 p = eexp(o(n)) as 
long as Condition (C)(ii) is satisfied. As in Zhang (2010), under a somewhat 
stronger version of Condition (C), Theorem 2 can be extended to quadratic 
spline concave penalties satisfying /3(t;Ai,7) = Afp(t/A;7) with a penalty 
function satisfying (d / dt) p{t; = 1 at t = 0+ and for t > 7. 

Also, comparing our results with the selection consistency results of Hebiri 
and van de Geer (2010) on the smoothed £1 +^2-penalized methods, our con- 
ditions tend to be weaker. Notably, Hebiri and van de Geer (2010) require an 
condition on the Gram matrix which assumes that the correlations between 
the truly relevant variables and those which are not are small. No such as- 
sumption is required for our selection consistency results. In addition, our 
selection results are stronger in the sense that the SLS estimator is not only 
sign consistent, but also equal to the oracle Laplacian shrinkage estimator 
with high probability. In general, similar results are not available with the 
use of the £1 penalty for sparsity. 

Theorem 2 shows that the SLS estimator automatically adapts to the 
sparseness of the p-dimensional model and the denseness of a true submodel. 
From a sparse p- model, it correctly selects the true underlying model O. 
This underlying model is a dense model in the sense that all its coefficients 
are nonzero. In this dense model, the SLS estimator behaves like the oracle 
Laplacian shrinkage estimator in (4.1). As in the convex penalized loss set- 
ting, here the results do not require a correct specification of a population 
correlation structure of the predictors. 

4.3. Unbiased Laplacian and variance reduction. There are two natural 
questions concerning the SLS. First, what are the benefits from introduc- 
ing the Laplacian penalty? Second, what kind of Laplacian L constitutes 
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a reasonable choice? Since the SLS estimator is equal to the oracle Lapla- 
cian estimator with high probability by Theorem 1 or 2, these questions can 
be answered by examining the oracle Laplacian shrinkage estimator (4.1), 
whose nonzero part is 

Without the Laplacian, that is, when A2 = 0, it becomes the least squares 
(LS) estimator 

If some of the predictors in {xj,j E O} are highly correlated or \0\ > n, 
the LS estimator /^^(O) is not stable or unique. In comparison, as discussed 
below Theorem 1, So(A2) = So + \2Lo can be a full rank matrix under 
a reasonable condition, even if the predictors in {xj,j € O} are highly cor- 
related or > n. 

For the second question, we examine the bias of f3Q{\2)- Since the bias 

of the target vector (4.4) is - /3o(A2) = \2T.~^^{\i)Lol3o^ ^oi^'i) is 
unbiased iff LoPo = 0. Therefore, in terms of bias reduction, a Laplacian L 
is most appropriate if the condition L^Pq = is satisfied. We shall say that 
a Laplacian L is unbiased if LoPq = 0. It follows from the discussion at the 
end of Section 4.1 that Lol^o = if /^fc = I^jg<^j3kiakik2 ■ ■ ■ ^k^k, where jg is 
a representative member of the clique VgCiO and {ki , . . . , km, k} O Vg Ci O . 
With an unbiased Laplacian, the mean square error of Pq{X2) is 

2 

E\\P"ai^2) - fBhf = ^trace(S0i(A2)SoS0'(A2)). 
The mean square error of (3o{0) is 

2 

^^ll/9o(0)-/3of = ^trace(S5i). 

We always have ^^^(As) -/3of < £;||^^(0) -/3^f for A2 > 0. Therefore, 
an unbiased Laplacian reduces variance without incurring any bias on the 
estimator. 

5. Laplacian shrinkage. The results in Section 4 show that the SLS esti- 
mator is equal to the oracle Laplacian shrinkage estimator with probability 
tending to one under certain conditions. In addition, an unbiased Lapla- 
cian reduces variance but does not increase bias. Therefore, to study the 
shrinkage effect of the Laplacian penalty on (3, we can consider the oracle 
estimator Pq. To simplify the notation and without causing confusion, in 
this section, we study some other basic properties of the Laplacian shrinkage 
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and compare it with the ridge shrinkage. The Laplacian shrinkage estimator 
is defined as 

(5.1) ^(Aa) = argmin|G(b; A2) = 7^||y - Xhf + ^Aab'Lb, b G M")- 

b I 2n 2 J 

The following proposition shows that the Laplacian penalty shrinks a co- 
efficient toward the center of all the coefficients connected to it. 

Proposition 1. Let f = y - xp. 
(i) 

A2 max dj\j3j — a'-/3/(i,-| < ||f|| < ||y||. 
i<j<ij 

(ii) 

X2\dj(3j - a'jP- (4/3fc -a'^^)| < i||xj -Xfc||||y||. 

Note that a'jP/dj = Ylk=i O'jA/dj = Yl=i ^Z'^{0'jk)\ajk\i3k/ dj is a signed 
weighted average of the /3fc's connected to j3j, since dj = \ ajk\- Part (i) 
of Proposition 1 provides an upper bound on the difference between Pj 
and the center of all the coefficients connected to it. When ||f ||/(A2dj) — )• 0, 
this difference converges to zero. For standardized dj = 1, part (ii) implies 
that the difference between the centered Pj and Pk converges to zero when 
- Xfc||||y||/(A2n) -^0. 

When there are certain local structures in the adjacency matrix A, shrink- 
age occurs at the local level. As an example, we consider the adjacency 
matrix based on partition of the predictors into 2r-balls defined in (3.1). 
Correspondingly, the index set { !,...,</} is divided into disjoint neighbor- 
hoods/cliques Vi, . . . ,Vj. We consider the normalized Laplacian L = Ig — A, 
where Iq is a q x q identity matrix and A = diag{Ai, ... ,Aj) with Ag = 
v~^l'gl. Here Vg = \Vg\, I < g < J. Let b^ = (6j, j G Vg)' . We can write the 
objective function as 

1 1 ^ 

(5.2) G(b; A2) = ^ lly - Xhf + -A2 K(^9 - v;%lg)hg. 

9=1 

For the Laplacian shrinkage estimator based on this criterion, we have the 
following grouping properties. 

Proposition 2. (i) For any j,k £Vg,l < g < J , 

h\Pj - Pk\ < -||xj - Xfcll • ||y||, j,ke Vg. 
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(ii) Let j3g he the average of the estimates in Vg. For any j & Vg and 
k£Vh, g^h, 

hlPj -Pg- {h - h)\ < ^l|xj - Xfcll • ||y||, 3 eVg,ke Vh. 

This proposition characterizes the smoothing effect and grouping property 
of the Laplacian penalty in (5.2). Consider the case ||y|P/?^ = 0(1)- Part (i) 
imphes that, for j and k in the same neighborhood and A2 > 0, the difference 
/3j — — )• if ||xj — Xfc||/(A2?T-^/2) 0. Part (ii) imphes that, for j and k in 

different neighborhoods and A2 > 0, the difference between the centered Pj 

and converges to zero if ||xj — Xk\\/{X2n^^'^) — )• 0. 

We now compare the Laplacian shrinkage and ridge shrinkage. The dis- 
cussion at the end of Section 4 about the requirement for the unbiasedness of 
Laplacian can be put in a wider context when a general positive definite or 
semidefinite matrix Q is used in the place of L. This wider context includes 
the Laplacian shrinkage and ridge shrinkage as special cases. Specifically, let 

1 ^ 1 

^q(A,7) = argmin — ||y - Xhf + ^p(|6j|; Ai,7) + -Aab'Qb. 

^ i=i 

For Q = Ip, Pq becomes the Mnet estimator [Huang et al. (2010a)]. With 
some modifications on the conditions in Theorem 1 or Theorem 2, it can be 
shown that /3q is equal to the oracle estimator defined as 

^^(A2) = argminj^lly - Xhf + ^b'Qb, h, = 0,i ^ o|. 

Then in a way similar to the discussion in Section 4, /3q is nearly unbiased 
iff QoPo ~ 0- Therefore, for \\I3q\\ / 0, Qo must be a rank deficient matrix, 
which in turn implies that Q must be rank deficient. Note that any Lapla- 
cian L is rank deficient. This rank deficiency requirement excludes the ridge 
penalty with Q = Ip. For the ridge penalty to yield an unbiased estimator, 
it must hold that \\f3°\\ =0 in the underlying model. 

We now give a simple example that illustrates the basic characteristics of 
Laplacian shrinkage and its differences from ridge shrinkage. 

Example 5.1. Consider a linear regression model with two predictors 
satisfying ||xj|p = 7i, j = 1,2. The Laplacian shrinkage and ridge estimators 
are defined as 

1 " 1 
(&li(A2),6l2(A2)) = argmin — ^^(yi - xnh - ^^262)^ + 7T-^2(&i - ^2)^ 
bub2 2n ^ 2 

and 

1 " 1 

(^m(A2),^fl2(A2)) = argmin — V'd/i - xnh - Xi2b2f + T,^2{h\ + hl). 

biM 2n ^ 2 
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Denote ri = cor(xi,y), r2 = cor(x2,y) and ri2 = cor(xi,X2). The Laplacian 
shrinkage estimates are 

I (1 + A2)ri - (ri2 - A2)r2 c- , , (1 + A2)r2 - (ri2 - A2)ri 

'''^^'^ - (l + A2)2-(n2-A2)2 ' - (l + A2)2-(n2-A2)2 • 

Let 

~ ri - ri2r2 - r2 - ri2ri ri + r2 

where (ftoisi, ^ois2) is the ordinary least squares (OLS) estimator for the 
bivariate regression, bL^oo) is the OLS estimator that assumes the two 
coefficients are equal, that is, it minimizes — {xn + Xi2)b)^. Let 

tfL = (2A2)/(1 — ri2 + 2A2). After some simple algebra, we have 

6li(A2) = (1 - ';^l)^o1s1 + WLbL{oo) 

and 

fcL2(A2) = (1 - WL)bols2 + WLbL{oo). 

Thus, for any fixed A2, 6l(A2) is a weighted average of 6ois and 6^(00) 
with the weights depending on A2. When A2 — )• 00, bLi — )• 6l(oo) and 6^2 — ^ 
6^(00). Therefore, the Laplacian penalty shrinks the OLS estimates toward 
a common value, which is the OLS estimate assuming equal regression co- 
efficients. 

Now consider the ridge regression estimator. We have 

t . (1 + A2)ri - ri2r2 a u i\ \ (1 + A2)r2 - ri2ri 
'^^^'^)= (l + A2)2-rf, '^^('^^= {l + X.?-rl, ■ 

The ridge estimator converges to zero as A2 — >• 00. For it to converge to 
a nontrivial solution, we need to rescale it by a factor of 1 + A2. Let wr = 
A/(l + A — r'l2)- Let 6„i = ri and 6^2 = '"2- Because ?^~^X]r=i^a ~ ^ ^^"^ 
2;?2 = 1, ri and r2 are also the OLS estimators of univariate re- 
gressions of y on xi and y on X2, respectively. We can write 

(1 + A2)6ri(A2) = CA2(1 - WR)bo\si + CxWrIuI, 
(1 + A2)6r2(A2) = CA2(1 - WR)bo\s2 + C\WRbu1, 

where ca^ = {(1 + X2? - (1 + A)rf2}/{(1 + \2? - r-?2}- Note that ca^ ~ 1. 
Thus, (1 + A2)&_R is a weighted average of the OLS and the univariate re- 
gression estimators. The ridge penalty shrinks the (rescaled) ridge estimates 
toward individual univariate regression estimates. 

6. Simulation studies. We use a coordinate descent algorithm to com- 
pute the SLS estimate. This algorithm optimizes a target function with 
respect to a single parameter at a time and iteratively cycles through all pa- 
rameters until convergence. This algorithm was originally proposed for cri- 
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terions with convex penalties such as LASSO [Fu (1998), Genkin, Lewis and 
Madigan (2004), Friedman et al. (2007), Wu and Lange (2008)]. It has been 
proposed to calculate the MCP estimates [Breheny and Huang (2011)]. De- 
tailed steps of this algorithm for computing the SLS estimates can be found 
in the technical report accompanying this paper [Huang et al. (2010b)]. 

In simulation studies, we consider the following ways of defining the adja- 
cency measure. (N.l) Ujk = I{rjk > and Sjk = 1. Here the cutoff r is compu- 
ted as 3.09 using the approach described in Section 3 with a p- value of 10~^; 
(N.2) Ujk = I{\rjk\ > r) and Sjk = sgn(rjfc). Here the cutoff r is computed 
as 3.29 using the approach described in Section 3 with a value of 10~^; 
(N.3) ajk = max(0, ^j^)" and sjk = 1- We set a = 6, which satisfies the scale- 
free topology criteria [Zhang and Horvath (2005)]; (N.4) ajk = r°^. and Sjk = 
sgn(rjfc). We set q = 6. 

The penalty levels Ai and A2 are selected using F-fold cross validation. 
In our numerical study, we set V = 5. To reduce computational cost, we 
search over the discrete grid of 2-- '"^'"''-^'''''^-^'---. For comparison, we also 
consider the MCP estimate and the approach proposed in Daye and Jeng 
(2009); referred to as D~J hereafter. Both the SLS and MCP involve the 
regularization parameter 7. For MCP, Zhang (2010) suggested using 7 = 
2/(1 — niSiKj^k\x'jXk\/n) for standardized covariates. The average 7 value 
of this choice is 2.69 in his simulation studies. The simulation studies in 
Breheny and Huang (2011) suggest that 7 = 3 is a reasonable choice. We 
have experimented with different 7 values and reached the same conclusion. 
Therefore, we set 7 = 3. 

We set n = 100 and p = 500. Among the 500 covariates, there are 100 
clusters, each with size 5. We consider two different correlation structures. 
(I) Covariates in different clusters are independent, whereas covariates i 
and j within the same cluster have correlation coefficients p'*"-''; and (II) co- 
variates i and j have correlation coefficients p'*"-'!. Under structure I, zero 
and nonzero effects are independent, whereas under structure II, they are 
correlated. Covariates have marginal normal distributions with mean zero 
and variance one. We consider different levels of correlation with p = 0.1, 0.5, 
0.9. Among the 500 covariates, the first 25 (5 clusters) have nonzero regres- 
sion coefficients. We consider the following scenarios for nonzero coefficients: 
(a) all the nonzero coefficients are equal to 0.5; and (b) the nonzero coeffi- 
cients are randomly generated from the uniform distribution on [0.25,0.75]. 
In (a), the Laplacian matrices satisfy the unbiasedness property L/3° = 
discussed in Section 4. We have experienced with other levels of nonzero 
regression coefficients and reached similar conclusions. 

We examine the accuracy of identifying nonzero covariate effects and the 
prediction performance. For this purpose, for each simulated dataset, we 
simulate an independent testing dataset with sample size 100. We conduct 
cross validation (for tuning parameter selection) and estimation using the 
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training set only. We then make prediction for subjects in the testing set 
and compute the PMSE (prediction mean squared error). 

We simulate 500 replicates and present the summary statistics in Table 1. 
We can see that the MCP performs satisfactorily when the correlation is 
small. However, when the correlation is high, it may miss a considerable 
number of true positives and have large prediction errors. The D-J approach, 
which can also accommodate the correlation structure, is able to identify 
all the true positives. However, it also identifies a large number of false 
positives, causing by the over-selection of the Lasso penalty. The proposed 
SLS approach outperforms the MCP and D-J methods in the sense that it 
has smaller empirical false discovery rates with comparable false negative 
rates. It also has significantly smaller prediction errors. 

6.1. Application to a microarray study. In the study reported in Scheetz 
et al. (2006), Fl animals were intercrossed and 120 twelve- week-old male 
offspring were selected for tissue harvesting from the eyes and microarray 
analysis using the Affymetric GeneChip Rat Genome 230 2.0 Array. The 
intensity values were normalized using the RMA [robust multi-chip averag- 
ing, Bolstad et al. (2003), Irizarry et al. (2003)] method to obtain summary 
expression values for each probe set. Gene expression levels were analyzed 
on a logarithmic scale. For the probe sets on the array, we first excluded 
those that were not expressed in the eye or that lacked sufficient variation. 
The definition of expressed was based on the empirical distribution of RMA 
normalized values. For a probe set to be considered expressed, the maximum 
expression value observed for that probe among the 120 F2 rats was required 
to be greater than the 25th percentile of the entire set of RMA expression 
values. For a probe to be considered "sufficiently variable," it had to exhibit 
at least 2-fold variation in expression level among the 120 F2 animals. 

We are interested in finding the genes whose expression are most variable 
and correlated with that of gene TRIM32. This gene was recently found to 
cause Bardet-Biedl syndrome [Chiang et al. (2006)], which is a genetically 
heterogeneous disease of multiple organ systems including the retina. One 
approach to find the genes related to TRIM32 is to use regression analysis. 
Since it is expected that the number of genes associated with gene TRIM32 
is small and since we are mainly interested in genes whose expression values 
across samples are most variable, we conduct the following initial screening. 
We compute the variances of gene expressions and select the top 1,000. We 
then standardize gene expressions to have zero mean and unit variance. 

We analyze data using the MCP, D-J, and proposed approach. In cross 
validation, we set V = 5. The numbers of genes identified are MCP: 23, D-J: 
31 (N.l), 41 (N.2), 34 (N.3), 30 (N.4), SLS: 25 (N.l), 26 (N.2), 16 (N.3) and 
17 (N.4), respectively. More detailed results are available from the authors. 
Different approaches and different ways of defining the adjacency measure 
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lead to the identification of different genes. As expected, the SLS identifies 
shorter lists of genes than the D-J, which may lead to more parsimonious 
models and more focused hypothesis for confirmation. As the proposed ap- 
proach pays special attention to the correlation among genes, we also com- 
pute the median of the absolute values of correlations among the identified 
genes, which are MCP: 0.171, D-J: 0.201 (N.l), 0.207 (N.2), 0.215 (N.3), 
0.206 (N.4), SLS: 0.247 (N.l), 0.208 (N.2), 0.228 (N.3), 0.212 (N.4). The 
D-J and SLS, which incorporate correlation in the penalty, identify genes 
that are more strongly correlated than the MCP. The SLS identified genes 
have slightly higher correlations than those identified by D-J. 

Unlike in simulation study, we are not able to evaluate true and false po- 
sitives. This limitation is shared by most existing studies. We use the follow- 
ing F-fold {V = 5) cross validation based approach to evaluate prediction, 
(a) Randomly split data into ^-subsets with equal sizes; (b) Remove one 
subset from data; (c) Conduct cross validation and estimation using the rest 
V — 1 subsets; (d) Make prediction for the one removed subset; (e) Repeat 
Steps (b)-(d) over all subsets and compute the prediction error. The sums 
of squared prediction errors are MCP: 1.876; D-J: 1.951 (N.l), 1.694 (N.2), 
1.534 (N.3) and 1.528 (N.4); SLS: 1.842 (N.l), 1.687 (N.2), 1.378 (N.3) and 
1.441 (N.4), respectively. The SLS has smaller cross validated prediction 
errors, which may indirectly suggest better selection properties. 

7. Discussion. In this article, we propose the SLS method for variable 
selection and estimation in high-dimensional data analysis. The most impor- 
tant feature of the SLS is that it explicitly incorporates the graph/network 
structure in predictors into the variable selection procedure through the 
Laplacian quadratic. It provides a systematic framework for connecting pe- 
nalized methods for consistent variable selection and those for network and 
correlation analysis. As can be seen from the methodological development, 
the application of the SLS variable selection is relatively independent of the 
graph/network construction. Thus, although graph/network construction is 
of significant importance, it is not the focus of this study and not thoroughly 
pursued. 

An important feature of the SLS method is that it incorporates the cor- 
relation patterns of the predictors into variable selection through the Lapla- 
cian quadratic. We have considered two simple approaches for determining 
the Laplacian based on dissimilarity and similarity measures. Our simula- 
tion studies demonstrate that incorporating correlation patterns improves 
selection results and prediction performance. Our theoretical results on the 
selection properties of the SLS are applicable to a general class of Laplacians 
and do not require the underlying graph for the predictors to be correctly 
specified. 
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We provide sufficient conditions under which the SLS estimator possesses 
an oracle property, meaning that it is sign consistent and equal to the ora- 
cle Laplacian shrinkage estimator with high probability. We also study the 
grouping properties of the SLS estimator. Our results show that the SLS is 
adaptive to the sparseness of the original p-dimensional model with n 
and the denseness of the underlying d°-dimensional model, where d° <n 
is the number of nonzero coefficients. The asymptotic rates of the penalty 
parameters are derived. However, as in many recent studies, it is not clear 
whether the penalty parameters selected using cross validation or other pro- 
cedures can match the asymptotic rate. This is an important and challenging 
problem that requires further investigation, but is beyond the scope of the 
current paper. Our numerical study shows a satisfactory finite-sample per- 
formance of the SLS. Particularly, we note that the cross validation selected 
tuning parameters seem sufficient for our simulated data. We are only able 
to experiment with four different adjacency measures. It is not our intention 
to draw conclusions on different ways of defining adjacency. More adjacency 
measures are hence not explored. 

We have focused on the linear regression model in this article. However, 
the SLS method can be applied to general linear regression models. Specifi- 
cally, for general linear models, the SLS criterion can be formulated as 



where £ is a given loss function. For instance, for generalized linear models 
such as logistic regression, we can take I to be the negative log-likelihood 
function. For Cox regression, we can use the negative partial likelihood as the 
loss function. Computationally, for loss functions other than least squares, 
the coordinate descent algorithm can be applied iteratively to quadratic 
approximations to the loss function. However, further work is needed to 
study theoretical properties of the SLS estimators for general linear models. 

There is a large literature on the analysis of network data and much 
work has also been done on estimating sparse covariance matrices in high- 
dimensional settings. See, for example, Zhang and Horvath (2005), Chung 
and Lu (2006), Meinshausen and Biihlmann (2006), Yuan and Lin (2007), 
Friedman, Hastie and Tibshirani (2008), Fan, Feng and Wu (2009), among 
others. It would be useful to study ways to incorporate these methods and 
results into the proposed SLS approach. In some problems such as genomic 
data analysis, partial external information may also be available on the 
graphical structure of some genes used as predictors in the model. It would 





i<j<fc<p 
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be interesting to consider approaches for combining external information 
on the graphical structure with existing data in constructing the Laplacian 
quadratic penalty. 

APPENDIX 

In this Appendix, we give proofs of Theorems 1 and 2 and Propositions 1 
and 2. 

Proof of Theorem 1. Since Cmin(A2) > I/7, the criterion (2.2) is 
strictly convex and its minimizer is unique. Let X = X{\2) = + A2L)^/^, 

y = y(A2) = X-iX'y and 

M(b;A,7) = (2n)-i||y-Xbf + J^p(|6,.|;Ai,7). 

i=i 

Since X'{X/n, y) = (E + A2L, X'y), M(b; A, 7) - M(b; A, 7) = (||y f - ||y f )/ 
(2n) does not depend on b. Thus, /3 is the minimizer of M(b; A, 7). 

Since |/3°| > 7A1 gives p'{\'^°\]\i) = 0, the KKT conditions hold for M(b; 
A, 7) at /3(A) = /3°(A) in the intersection of the events 

(A.l) ni = {\\X'o.{y-Xp°)/n\\^<\i], 172 = {minsgn(/3;)/3^°>7Ai}. 

Let e* =y- Xf3* = e + Ee* with e = y - ^y . Since X'y = X'y and both /3° 
and (3* are supported in O, 

X'sEr/n = X'sXp°/n - X'^Xp* /n 

(A.2) = Sb,o/3^ - Sb,o(A2)S-HA2)So/3^ 

= ^2{^B,o{M)'^~o{^2)Lo - Lb,o}Po^ 

which describes the effect of the bias of /3° on the gradient in the linear 
model y = XP* + s* . Since X'^Ee*/n = 0, we have \\X' Ee* /n\\oo = A2C2. 
Since X'e = X'y - EX'y = X'y - EX'y = X'e, (A.2) gives 

(A.3) ^i^{\\X'oce/n\\^<\i-\2C2}. 

Since (3* = EP", = J:^\X2)X'(^y /n can be written as f3o + {{vj/n)^/'^u'je, 
j € Oy , where ||uj || = 1 and {vj,j G O} are the diagonal elements of S^^(A2) x 
Eo{S-i(A2)}. Thus, 

(A.4) n',Q U {sgn(/3*)u;.£ > (n/z;,)V2|^*| > aV21og(|0|/e)}. 



SPARSE LAPLACIAN SHRINKAGE ESTIMATOR 23 



Since Ai > A2C2 + aY^21og(|j7e) maxj<p ||xj||/n, the sub-Gaussian Condi- 
tion (A) yields 



1 -P{Oinl72} < p|||X^ce/n||oo >cTY^21og((p- lOD/ejmaxllxjII/nj 
+ ^ P{sgn(/3;)u;.£ > a^2\og{\0\/e)] 

< 2\0^\e/{p -\0\) + \0\e/\0\ = 3e. 

The proof of (4.5) is complete, since 13° ^0 for all j £ O in 0,2- 
For the proof of (4.6), we have \\(3q — (3q\\oo = X2C1 due to 

(A.5) Ph-Po = ^oH>^2)^oPo -Po = -\2^o\^2)LoP'h- 

It follows that the condition on (3^, implies Condition (B)(iii) with sgn^fB"^) = 

sgn(/3^)=sgn(^^) in02. □ 

Proof of Theorem 2. For m > 1 and vectors u in the range of X, 
define 

C(v;m,0,A2) 
(A.6) ^ ^ 

= ^^E^f^l^Ml,0CBC{l,....p},\B\=m + \0\], 

where Pb = Xb{X'qXb)~^X'^. Here Q depends on A2 through P. Since /3(A) 
is the MC+ estimator based on data (-^^,5^) at penalty level Ai and (4.9) 
holds for S(A2) = X' X /n, the proof of Theorem 5 in Zhang (2010) gives 
^(A) = ^°(A) in the event O = H^^i ^j, where Oi = {||X^,(y-X/3°)/n||oo < 
Ai} is as in (A.l) and 



O2 = {rnmsgn(^;)/3° > 7(2^/?Ai)}, 



^3 = {C(y - Xf3*;d* -\OlO, A2) < Ai}. 

Note that (Ai^^, A2,e, As^^, a) in Zhang (2010) is identified with (Ai, 2\/c*Ai, Ai, 
1/2) here. 

Let §* = y - Xp* = e + Ee* with e = y- Ey. Since X'y = X'y, (A.2) 

still holds with \\X' Ee* /n\\oo = X2C2. Since X'e = X'y - EX'y = X'e, (A.2) 
still gives (A. 3). A slight modification of the argument for (A. 4) yields 

C U{sgn(^;)u;.s> (n/t;,)V2(|;3;| -7(2^/?Ai)) 

(A.7) 

>av/21og(|0|/e)}. 
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For \B\ < d*, we have \\PBEe*\\/^/n = ||S^^/^(A2)X^^e*||/n < 
\\X'sEe*/n\\o^y /\B\/c,{X2) and = ||S-'/'(A2)X^£||/n < 

\\X'j^e/n\\oo X VI^I/c*(A2). Thus, by (A.6) 

ay-Xf3*;d*-\0\,0,\2) = Ci£ + Ee*;d*-\0\,0,X2) 

^ i\\X's/n\\^ + X2C2)V¥ 
- y'{d*-\0\)c,{X2) 

Since \0\ < d* /{K^ + I), this gives 



(A.8) n3 0{\\X'e/n\\^<^/^X2)Kj{K7+l)Xi-X2C2}. 



Since max{l, c^{X2)K^/ {K^ + l)}Ai > A2C2 + (T\/2 log(p/ e) maxj <p ||xj||/n, 
(A.3), (A.7), (A.8) and Condition (A) imply 

1 - P{17i n rjg} + ^'{^^2} 

max ||xj ll/n > 
+ P{sgn(/3*)u;.£ > cTV21og(|0|/e)} 

<2p(6/p) + |0|e/|0|=3e. 

The proof of (4.10) is complete, since /3° 7^ for all j & O in f]2- We omit 
the proof of (4.11) since it is identical to that of (4.6). □ 

Proof of Proposition 1. The ^ satisfies 

(A.9) -^x;.(y-X^) + A2(dj/3,-a;.^) = 0, 1 < j < g. 

Therefore, by Cauchy-Schwarz and using ||xj|p = n, we have 

A2 max IdjBi — a'-/3| < — max |x'-(y — Xd)\ < ^^||f||. 
i<i<g ni<j<q 

Now because G(/3; A2) < G(0; A2), we have ||f|| < ||y||. This proves part (i). 
For part (ii), note that we have 

X2{dj^j - a'jP - {dkh - a'fcyS)) = ;^(xj - Xfc)'r. 

Thus 

X2\djj3j - a!j^ - {dkPk - ^kM < -Xfc||||f ||. 
Part (ii) follows. □ 
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Proof of Proposition 2. The (3 must satisfy 
(A.IO) -^^'j{y-Xp) + X2{^j-v;H'g~^g)=0, jeVg,i<g<J. 

Taking the difference between the jth and A;th equations in (A.IO) for 
j,ke Vg, we get 

- h) = -(xj - Xfc)'(y - XP), J, k e Vg. 
n 

Therefore, 

A2|/3j -/3fc| < ^||xj -Xfcll • ||y-X^||, j,keVg. 

Part (i) follows from this inequality. 

Define Pg = Vg^l'gPg. This is the average of the elements in 0^. For any 
j &Vg and k £ Vh, g ^ h, we have 

Wj - 4 - i^k - M) = h^j - xfc)'(y - x~^), j €Vg,ke v^. 

Thus, part (ii) follows. This completes the proof of Proposition 2. □ 
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